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We show that large-scale typicality of Markov sample paths implies that 
the likelihood ratio statistic satisfies a law of iterated logarithm uniformly 
to the same scale. As a consequence, the penalized likelihood Markov order 
estimator is strongly consistent for penalties growing as slowly as log log ?i 
when an upper bound is imposed on the order which may grow as rapidly as 
log n. Our method of proof, using techniques from empirical process theory, 
does not rely on the explicit expression for the maximum likelihood estimator 
in the Markov case and could therefore be applicable in other settings. 

1. Introduction. For the purposes of this paper, a Markov chain is a discrete 
time stochastic process (Xfc)fc>i, taking values in a state space A of finite cardi- 
nality I A| < oo, such that the conditional law of X}^ given the past Xi, . . . , Xf^_i 
depends on the most recent r states X^-r, • • • , X^^i only. The smallest number r 
for which this assumption is satisfied is called the order of the Markov chain. It is 
evident that the order of a Markov chain determines the most parsimonious repre- 
sentation of the law of the process. Thus estimation of the order from observed data 
is a problem of practical interest, which moreover raises interesting mathematical 
questions at the intersection of probability, statistics and information theory. 

Denote by 'P{xi-n) the probability of the sequence xi-n G A" under the law 
P, and denote by Q*" the collection of all laws of Markov chains whose order is 
at most r. As the parameter spaces 0'' C 0''^^ are increasing, the naive maxi- 
mum likelihood estimate of the order r„ = argmax^ suppgQr P{xi-n) fails to be 
consistent. Instead, we intoduce the penalized hkelihood order estimator 

fn = argmax < sup logP(2;i:„) — pen(n,r) > , 

0<r<K{n) [pgB'- J 

where pen(n, r) is a penalty function and K{n) is a cutoff function. The estimator 
is called strongly consistent if f„ r* P*-a.s. as n ^ cxo whenever the law of 
the observations P* is the law of a Markov chain whose order is r*. We aim to 
understand which penalties and cutoffs yield a strongly consistent estimator. 



AMS2000 subject classifications: Primary 62M05; secondary 60E15, 60F15, 60G42, 60J10 
Keywords and phrases: order estimation, uniform law of iterated logarithm, martingale inequali- 
ties, empirical process theory, large-scale typicality, Markov chains 

1 



2 



RAMON VAN HANDEL 



Results of this type date back to Finesso who considers the case where the 
order r* of the Markov chain P* is known a priori to be bounded above by some 
constant r* < K. In this setting, Finesso shows that the penaUy and cutoff 

pen(n, r) = CIAI*" log log n, K{n) = K 

yield a strongly consistent order estimator for a sufficiently large constant C (by 
|[T]], p. 592, it suffices to choose C > 2|A|). It can be argued from the law of 
iterated logarithm for martingales that a penalty of this form is the minimal penalty 
that achieves strong consistency, so that the result is essentially optimal (in the 
sense that the probability of underestimation of the order is minimized). However, 
the requirement imposed by the knowledge of an a priori upper bound on the order 
is a significant drawback and is unrealistic in many applications. 

Order estimation in the absence of an upper bound has been investigated, for 
example, by Kieffer [Q]. However, the penalty used there is significantly larger than 
the minimal penalty in the case of an a priori upper bound. Kieffer's conjecture that 
the well known BIC penalty pen(n,r) = ^|A|'"(|A| — 1) logn yields a strongly 
consistent order estimator was proved by Csiszar and Shields [^]. The best result 
to date, due to Csiszar [Q], shows that the penalty and cutoff 

pen(n, r) = c|A|''logn, K{n) = oo 

yield a strongly consistent order estimator for any choice of the constant c > 0. 
However, this penalty is still larger than the minimal penalty obtained by Finesso 
in the case of an a priori upper bound on the order. These results raise a basic 
question [Q, []] : is the log n growth of the penalty the necessary price to be paid 
for the lack of a prior upper bound on the order, or is the minimal possible penalty 
log log n already sufficient for consistency in the absence of a prior upper bound? 

1.1. Results of this paper. The purpose of this paper is twofold. 

First, we will show that a penalty of order log log n does indeed suffice for 
consistency of the Markov order estimator, provided we impose a cutoff of or- 
der K{n) ~ logn. Remai^kably, this is precisely the same cutoff as is required to 
establish the consistency of minimum description length (MDL) order estimators 
||2|], of which the BIC penalty is an approximation. As the log log n penalty is much 
smaller than the BIC penalty for large n, this constitutes a significant improvement 
over previous results. However, the basic question posed above is only partially re- 
solved, as our results fall short of establishing consistency of the log log n penalty 
in the absence of a cutoff = oo as is done in [[| [J for the BIC penalty. 

Second, we introduce a new approach for proving consistency of order estima- 
tors in the absence of a prior upper bound on the order. The techniques used in 
previous work [Q Q] rely heavily on rather delicate explicit computations which 
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exploit the availability of a closed form expression for the maximum likelihood es- 
timator in the Markov case. In contrast, our method of proof, which uses techniques 
from empirical process theory [|^, ^, is entirely different and can be applied much 
more generally. The present approach could therefore provide a possible starting 
point for extending the results of Csiszar and Shields to problems where an explicit 
expression for the maximum likelihood is not available, such as the challenging 
problem of order estimation in hidden Markov models (see Chapter 15). 

1.2. Comparison with the approach of Csiszar and Shields. A direct conse- 
quence of our main result is that the penalty and cutoff 

pen(n,r) = C*| A|^ log log n, = a*logn 

with suitable constants C* and a* , where a* depends on the observation law P*, 
yield a strongly consistent penalized likelihood estimator (in order to obtain a 
strongly consistent order estimator which does not require prior knowledge of P* it 
suffices to choose ^(n) = o(log n)). The upper bound K,{n) = a* log n is inherited 
directly from the large scale typicality property which plays a central role also in 
||2|, []]. Our main result states that if large scale typicality holds with an upper bound 
r < K.{2n) on the order, then the likelihood ratio statistic satisfies a law of iterated 
logarithm uniformly for r < K{n) (the details are in the following section). Strong 
consistency of the penalized likelihood order estimator then follows directly. 

It is instructive to make a comparison with the approach of [Q, Q] for the penalty 
pen(n,r) = c|A|^logn. The proof of strong consistency in this setting consists 
of two parts. First, large-scale typicality is used to prove strong consistency of the 
estimator with cutoff K(n) = Q!*logn. Next, a separate argument is employed to 
show that the larger orders r > a* log n ai^e negligible. Our result improves the 
first part of the proof, as we show that the conclusion aheady holds for the smaller 
penalty pen(n, r) = C*| A|'' log log n. However, the second part of the proof is 
missing in our setting, and it is unclear whether such a result could in fact be 
established. The resolution of this problem should effectively identify the minimal 
penalty for Markov order estimation in the absence of a cutoff. 

Let us also note that the first part of the proof in [|~|] makes use of a sort of 
truncated law of iterated logarithm for the empirical transition probabilities of the 
Markov chain. However, the result in [[]] implies that the likelihood ratio statistic 
grows as log log n only for orders as large as log log n, while the bound grows as 
log n for orders as large as log n. Our main result shows that such a bound is not 
the best possible, resolving in the negative a question posed in ||2|], p. 1621. 

1.3. Organization of the paper. In Section |2|, we set up the notation to be used 
throughout the paper and state our main results. In Section ^, we reduce the proof 
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of our main result to the problem of establishing a suitable deviation bound. The 
requisite deviation bound is proved in Section 0. The proof is based on an extension 
of a maximal inequality of van de Geer [|^], which can be found in the Appendix. 

2. Main results. Let us fix once and for all the alphabet A of finite cardinality 
|A| < oo and the canonical space 17 = endowed with its Borel cj-field and 
coordinate process {Xk)k>i (Xkiuj) = uj{k) for uj G Q). We will write Xm-.n for 
a sequence (x^, • • • G f^n-m+i^ Moreover, for any probability measure P 
on Q., we will write Y'{xm:n) and 'P{xm:n\xr:s) instead of 'P{Xm:n = Xm:n) and 
'P{Xm-n = Xm:n\Xr:s = Xr:s), respectively, whenever no confusion can arise. 

A Markov chain is defined by a probability measure P such that for some r > 

n 

P(a;i:n) = P(a:i:r) H ^ {Xi\xi-r:i-l) for all n > r, Xi;n G A"". 
j=r+l 

We will always presume that our Markov chains are time homogeneous: 

P(Xj = Xr+l\Xi-r:i-l = Xi;r) = ^{Xr+l\xi:r) for all i > r, Xi-r+l G A^^"^. 

We denote by 0*" the set of all probability measures that satisfy these conditions for 
the given value of r (0^^ is the class of all i.i.d. processes). Note that 0'' C 0*"+^ 
for all r. The order of a Markov chain P is the smallest r > such that P G 0^ . 

Throughout the paper we fix a distinguished Markov chain P* of order r*, rep- 
resenting the true probability law of an observed process. We assume that P* is 
stationary and irreducible. On the basis of a sequence of observations xi-n we 
obtain an estimate f„ of the true order r* by maximizing the penalized likelihood 

rn = argmax < sup logP(2;i:„) — pen(n,r) > , 
o<r<K{n) [pes'- J 

where pen(n, r) is a penalty function and K{n) is a cutoff function. If 

r„ > r P -a.s., 

the estimator is called strongly consistent. 

Remark 2.1. As discussed in [[]], the assumption that P* is irreducible is 
necessary for the order estimation problem to be well posed, while stationarity of 
P* entails no loss of generality. In particular, the latter claim follows from the fact 
that any irreducible Markov chain P is absolutely continuous with respect to a 
stationary Markov chain Pg with the same transition probabilities, so that strong 
consistency under Pg automatically holds under P also. 
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Define for any sequence ai:r £ and n > 1 the random variable 

n 

i=r+l 

that is, Nn{ai-r) is the number of times the sequence ai:r appears as a subse- 
quence of xi-n-i- By the ergodic theorem, the approximation Nn{ai;r) /{n — r) ^ 
P*{ai:r) holds for large n. The large scale typicality property essentially requires 
that this approximation holds uniformly for all ai-r with r < p{n). As in [§, ||], 
this idea plays an essential role in the proof of our main result. 

Definition 2.2. The process P* is said to satisfy the large-scale typicality 
property with cutoff p{n) if there exists a constant r/ < 1 such that 

1 Nn{ai:r) 



1 



< ri for all ai.r G A*" with P*(ai:r.) > 0, r < p{n) 



P*(ai:r) n — r 
eventually as n — > oo P*-a.s. 

We are now ready to state the main result of this paper, which can be viewed 
as a law of iterated logarithm for the likelihood ratio statistic. A similar result was 
established in [[J, Lemma 3.4.1 for the case of a fixed order r > r*. Our key 
innovation is that here the result holds uniformly over the order r* < r < K{n), 
where K(2n) is a cutoff for which the lai^ge-scale typicality property holds. 

Theorem 2.3. Let K{n) < n/4 be an increasing function, such that the pro- 
cess satisfies the large-scale typicality property with cutoff K.{2n). Then there 
is a nonrandom constant Cq > (depending only on rf) such that 

sup TVwl sup logP(xi:„) - sup logP(3;i:„) i < Cologlogn 

r*<r<K(n) l"l [PS©'' Pe6''* J 

eventually as n ^ oo F*-a.s. 

The following sections are devoted to the proof of this result. As a corollary, we 
obtain the following conclusion for the order estimation problem. 

Corollary 2.4. There exist constants C* and a*, where a* depends on P*, 
such that any penalty and cutoff that satisfy eventually as n ^ oo 

pen(n,r) = |A|''/(n) loglogn, n{n) < a*logn, 

where K{n) oo and the function f{n) satisfies 

liminf/(n)>C^ lim Ml^^ = 0, 

n— >oo n— >oo 77, 

yield a strongly consistent Markov order estimator. 
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Proof. First, it is easy to see ([[]], Proposition A.l) that P*-a.s. 

limSUp— < sup logP(xi:n) — sup logP(xi:n) i < — C 



n^oo n I pg©!- 



PG0'' 



for some constant C > and all r < r*. As pen(n,r)/n ^ as n ^ c«, this 
implies that P*-a.s. we have eventually as n — > oo 

sup log P(xi:„) — pen(n, r) < sup log P(xi:„) — pen(n, r*) y r < r* . 
pgB'- pge''* 

As K(n) > r* for n sufficiently large, this shows that liminf^^oo > f* P*-a.s. 

On the other hand, it is shown in [Q, Q] that the lai^ge-scale typicality property 
holds with cutoff K(2n) < a* log 2n for some constant a* which depends on P* 
(the constant -q in Definition may be fixed arbitrarily). By Theorem O, 



sup 



sup logP(xi:, 



sup logP(xi:r 

pggr* 



< 



1 



2|A| 



r*<r<K{n) pen(n, r) [pge-' 
eventually as n — > cx) P*-a.s., provided C* is chosen sufficiently large. Note that 

1 1 lAr 1 lAi 



< 



pen(n,r) — pen(n,r*) pen(n,r) lAI** — \fKY'* pen(n, r) |A| — 1 
for all r > r*, so we find that P*-a.s. we have eventually as n — > oo 

sup logP(xi:n) — pen(n, r) < sup log P(xi:n) — pen(?i, r*) 
Peer Pee'-* 

for all r* < r < K{n). Thus limsup„_^oo f„ < r* P*-a.s. 



□ 



Remark 2.5. The proofs of large-scale typicality in [Q, [^j] actually establish 
a slightly stronger result, where the constant rj in Definition ^ is replaced by 
for some /3 > 0. This improvement is not needed for Theorem 23 to hold. 



n 



Remark 2.6. Theorem states that the constant Co depends only on the 
value of rj in Definition Unfortunately, the constants obtained by our method 
of proof are expected to be far from optimal; one can read off a value for Co of 
order 10^ in the proof of Theorem 2.3, which is likely excessively large. 

Remark 2.7. It is not difficult to establish that there is a constant C such that 

- < sup logP(a;i:„,) - sup logP(xi:„) i < C 
n I Pee'- Pee'-* J 
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for all n and r. It follows that 
1 f 



r>(log|A|)-ilogn Pen(n,r) [pge 



1 |A| - 1 

sup logP(xi;n) - sup logP(xi:„) \ < 

pg0r* J I I 



eventually as n ^ oo. In order to obtain a version of Corollary 2.4 with K{n) = oo, 
the key difficulty is therefore to deal with orders in the range a* log n < r < 
(log I A|)~^ log n. It is an open question whether it is possible to close this gap. 



3. Reduction to a deviation bound. The proof of Theorem |2^ consists of 
two steps. In this section, we will prove the result assuming that the likelihood 
ratio statistic satisfies a certain deviation bound. The requisite deviation bound, 
which is stated in the following Proposition, will be proved in the next section. 



PROPOSITION 3.1. Define Fn = GnC\ G2n, where Gn denotes the event 

1 Nn{ai;r 



Pnai:r 



1 



n — r 



< i^for all ai-r G A^' with 'P*{ai-r) > 0, r < p[n) 



with p{n) increasing and p{n) < n/2. Then there exist constants Ci, C{, C2 > 0, 
which can be chosen to depend only on rj, such that 



Fn n max <^ sup logP(xi:i) - logP*(xi:j|xi:r) } > £ 



i=n,...,2n 



PG0'' 



for all n > 1, r* < r < p{n), and e > C2IAI''. 



Conceptually, this result can be understood as follows. It is well known in clas- 
sical statistics that, in "regular" cases, the likelihood ratio statistic 

sup logP(2;i;„) - logP*(xi;„) 
Pee"- 

converges weakly as n — > 00 to a -distributed random variable. Therefore, we 
expect the likelihood ratio statistic to possess exponential tails at least for large n. 



Proposition 3.1 provides a precise nonasymptotic description of this phenomenon. 



We now prove Theorem |23| presuming that Proposition ^ holds. 



Proof of Theorem 2.3. We clearly need only consider sequences xi-n with 
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P*{xi:n) > 0. We begin with some straightforward estimates: 



sup — — < sup logP(xi:„) - sup logP(3;i:„) 



r* <r<K{n) 



Pee 



< sup 1^ < sup logP(xi:„) - logP*(xi:„) 



r* <r<K{n) 



Fee 



Ta17 { ^^^P ^OS'P{xi:n) - logP*(xi:,,|3;i:r) - logP*(xi;,.) 
r*<r<K(n) 1^1 [PG0'' 



< sup 

r* <r<K{n) 



|A| 



sup logP(xi:„) - logP*(xi:„|xi:r) > + C, 

Pee'- I 



for a constant C independent of n and xi-n- Here we have used that for any 
irreducible (and time homogeneous) Markov chain P*, there exists a constant 
< A < 1 such that P*(xi:,.) > A'' whenever P*{xi:r) > 0, so that 

^^^P IA> <C := log(l/A) sup T-— < oo. 

We conclude that it suffices to prove 



1 



sup -—- < sup logP(xi:„) - logP*(xi:n|xi:r.) > < Cq log log 



n 



r* <r<K{n) 



PG0 



eventually as n ^ oo P*-a.s. Define for simplicity 

Aj.r = sup logP(2;i:i) - logP*(xi:i|xi, 

Pee"" 

We can estimate 



F-yn n max 



sup 



2"<i<2"+l loglogZ ,.*<,.<«;(i) I A 



< P'* 



n max sup 



A,; 



> Co log log 2" 



< E 

r*<r<re(2"+i) 



n max Aj r > Co I A T' log log 2*^ 

2"<j<2"+i 



where we used that K{n) is increasing. Now let Fn be defined as in Proposition 3.1 
for p{n) = k(2?i). Then there exist Ci, C[ such that for all n sufficiently large, 



n max Aj ^ > Co I A r log log 2" 

2"<i<2"+l 



< C'Je-C'o|A|'-loglog2"/Ci 
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for all r* <r < k(2"+^). Therefore 



n max 



sup TTTTT > C^O 



2"<i<2"+i loglog i r*<r<ti{i) I A 



g-Cologlog2/Ci^-Co/Ci 



|A|^ 



r*<r<K(2"+i) 

for n sufficiently large. Thus for any choice of Cq > Ci, we find that 

1 A, 



n=l 



n max 



sup 



2"<i<2"+i log log i r*<r<K(i) I A 



< OO. 



By the Borel-Cantelli lemma, 
1 



A. 



p < Co eventually as n ^ oo P*-a.s. 



Fnn U max - — : sup , . , 

2"<i<2"+l log log 2 r*<r<K{i) |Ar 

But by large-scale typicality with cutoff K{2n), we know that must hold even- 
tually as n ^ oo P*-a.s. The result follows immediately. □ 



Remark 3.2. The proof of Theorem 2.3 shows that the large-scale typicahty 
property is in fact only needed along an exponentially increasing subsequence of 
times tn = 2", so that the assumption of the Theorem can be weakened slightly. 
However, the weaker assumption does not ultimately appear- to lead to better results 
than the full large-scale typicality assumption (for example, note that the proof of 
large-scale typicality in [H] already utilizes such a subsequence). 



Remark 3.3. Theorem 2.3 could be improved by employing the blocking 
procedure along the subsequence t„ = 7" for ai^bitrary 7 > 1. In this manner, 
one can establish that the result is still valid under the weaker assumption that the 
large-scale typicality property holds with cutoff ^(771,) for some 7 > 1. However, 
this does not appear to lead to a substantially different conclusion for the order esti- 
mation problem. In order to keep the notation and proofs as transparent as possible 
we have restricted our results to the case 7 = 2, but the necessary modifications 
for the case of arbitrary 7 > 1 are easily implemented. 



4. Proof of Proposition 3.1. The longest part of the proof of Theorem 2.3 



consists of the proof of Proposition 3.1. To establish this result, we adapt an ap- 
proach using techniques from empirical process theory [Q []] that was originally 
developed to obtain rates of convergence for nonpaiametric maximum likelihood 
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estimators in the i.i.d. setting. At the heart of the proof of Proposition 3A lies an 
extension of a maximal inequality for families of martingales under bracketing en- 
tropy conditions, due to van de Geer [[J, Theorem 8.13. The extension of this result 
that is needed for our purposes is developed in the Appendix. 

4.1. Preliminary computations. Any measure P G 0'' is uniquely determined 
by its initial probability P{xi-r) and its transition probability P{xr+i\xi-r)- It is 
easily seen that the measure which maximizes the log-likelihood logP(xi:„) of 
P G Q^' assigns unit probability to the observed initial path xi-r- Thus for r > r* 



sup logP(xi:„) - logP*(xi:„|2;i;,.) = SUp ^ log 



Pee 



i=r+l 



(^Xr^ I X' 



i—r:i—l 



P*(Xj|x. 



i—r:i—l 



The family of functions log{P{xi\xi-r:i-i) fP* {xi\xi-r:i-i)) (P S is P*-a.s. 
uniformly bounded from above but not from below. To avoid problems later on, we 
apply a standard trick. For any P G 6*", define 



P{xi\x^ 



i—r:i~l 



P(^Xi\xi—r:i—l^ ~\~ P (^Xi\x 



i—r:i—l , 



Thus P is a Markov chain whose transition probabilities are an equal mixture of 
the transition probabilities of P and P* (the initial probabilities of P are irrelevant 
for our purposes and need not be defined). By concavity of the logarithm, we find 



sup logP(3;i:„,) - logP*(a;i;„|xi:r.) < 2 sup ^ log 



i^X/^ I X ' 



i—r:i—l 



Pee 



Pee*^ 



i=r+l 



P* (Xj I X 



i •^i—r:i—l 



It therefore suffices to obtain a deviation bound for the right hand side of this 
expression, whose summands are P*-a.s. uniformly bounded above and below. 



4.2. Peeling. The first part of the proof of Proposition ^Jj aims to reduce the 
problem to a deviation inequality for martingales. To this end we employ a peeling 
device from the theory of weighted empirical processes. 

Define the natural filtration = (t{Xi, . . . , For any P G , we define 



E ■ 

i—r+l 



log 



P{xi\x 



i—r:i—l j 



log 



P(xj|x. 



i—r:i—l 



P*{Xi\x. 



i—r:i~l ) 



P*(Xj|Xj_r:i— l) 

which is a martingale (under P*) by construction. It is easily seen that 

P (Xj I Xj — J.; j_ 1 ) 



;=r+i 



P* [Xj I X 



i •^i—r:i—l 



MARKOV ORDER ESTIMATION 



11 



where we have defined 

= - 2^P {ai\Xi-r:i-l)lr" 
i=r+l a^gA 

We also define for any P, P' G the quantity 



P*{ai\xi- 



j=r+l a^eA 



Note that ^if„(P, P') defines a random distance on ©''. As we will see below, the 
role of the set F„ (and hence the large-scale typicality assumption) in the proof of 



Proposition ^ is that it allows us to control this random distance. 
Lemma 4.1. For any e > 0, n > 1 and r > r* 



Fn n max I sup logP(xi:i) - logP*(xi:i|xi:r.) ) > £ 
j=n,...,2n pge-- 



oo 
fc=0 



Fn n sup liy„(p,p*)<2fe£ . max Mf > 2^ 



Proof. From the discussion above, it is clear that 



F„n max < sup logP(a;i;j) - logP*(3;i:j|a;i:r) > > e 

i=n,...,2n Pg©'' I 



< P* 



P ^ 1 / F{Xi\Xi-r-l-l) \ £ 

Fn n max sup > log — - — > - 

i=n,...,2npge'-^^^ \P*{Xi\Xi-r:e-l) J 2 

Fn n max sup (Aff - Df] > ^ 

i=n,...,2n pgQr ^ J 2 



Now note that as — log x > 2 — 2^fx for x > 0, 



Therefore, we can estimate 
P 



F„n max < sup logP(xi:i) - logP*(2;i:j|2;i:r) ^ > e 

i=n,...,2n pg0r I 



< P* 



< P* 



Fn n max sup (Mf - Fi(P, P*)| > 

i=ri,...,2n pgQr >. > 

FnCi sup I max Mf - i7„(P, P*)l > 
Pee'' L«=n,..-,2n J 
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We now partition the space into an inner ring {P G 0'' : i/„(P,P*) < e} 
and a collection of concentric rings {P G : I'^'^s < Hn(P, P*) < (note 
that this is a random partition, as the quantity Hn(P, P') depends on the observed 
path). Applying the union bound gives the estimates 



Fn n max < sup logP(2;i:i) - logP*(3;i:i|xi:r.) > > E 
i=n,...,2n pg©?" I 



< P^ 



Fn n sup |. max Mf - i7„(P,P*) 1 lH4P,P*)<e > 



fc=l 



F„ n sup < max MP-i7„(P,P^) 

pgQr I j=n,...,2n 



X 1 



2'=-ie<H„(P,P*)<2'=e ^ 2 



<Ep* 

fc=0 



F„ n sup li^„(p,p*)<2fe£ . max Mf > 2^" 

pgQr ^ i=n,...,2n 



The proof is complete. 



□ 



4.3. Control of Hn- Our next task is to control the quantity Hn(P, P')- First, 
we show that on the event F„ the quantity Hn is comparable to 



H{P, P') = (P(ar-+i|ai:r)'/' " P'(a,+i |ai^,)i/2 

ai:,.+l6A'-+i 



which is a nonrandom squared distance on G''. 

Lemma 4.2. There exist constants C3, C4 iwc/z that for any n > 1, we have 

^2n(P,P') < C3^n(P,P') 

and 

(n - r) ^-^//(P, P') < //„(P, P') < (n - r) C4 F(P, P') 
for all P, P' G 0*" and r* < r < p{n) on the event Fn- 



Proof. It is easily seen that for any n > 1 

Hn{V,V')= A^n(«i:r)(P(ar+i|ai:.)i/2-P'(a,+i|ai^,)i 

ai:^+ieA'-+i 



/2^ 
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On the event F„, we have by construction 



(1 - r?) P*{a,..r) < ^^^^^^ < (1 + r?) P^{a,.,r) 



n — r 



and 



(1 - V) P^(ai.) < ^^^^ < (1 + ^) P*(ai:.) 



2n — r 

for all ai:r G and r < p{n). Here we have used that p{n) < p{2n) as p{n) is 
presumed to be increasing. In particular, we have 

1 ~l~ 71 '2iTl — T 1 ~l~ 77 
N2niai:r) < 1 T Nn{ai,r) < 4 ^ f iVn(ai:r.), 



1 — 7/ n — r 



1 — r] 



where we have used that n — r > n/2asr < p{n) < n/2. The result follows 
directly provided we choose C3 , C4 (depending only on rj) sufficiently large. □ 

Next, we control the quantity i?„(P,P*) in terms of the "Bernstein norm" 
needed in order to apply the results developed in the Appendix. As in the Ap- 
pendix, we define the function = — x — I. 



Lemma 4.3. Define for any P G 6^, r > r* and n > 1 



j=r+l 



log 



Y'*{xi\x- 



i—r:i~l ) 



3'i-i 



Then < 8i7„(P, P*)/or any P G r > and n > 1. 



PROOF. Note that log(P(xi|a;i_^:i_i)/P'^(xi|xi_^:i_i)) > -log(2). By [0], 
Lemma 7.1, we have (j){\x\) < (e^ - 1)^ for any x > — log(2)/2. Therefore 



< 8 5: 

i=r+l 



(^•^i\^i—r:i—l 



,1/2 



^ ^ P*(aj|Xi_r:i-l) 

;=r+l aiSA 



' P(at|a:i-r:i-i)^/^ 
P*(a,|xi_,^i_i)i/2 



The result follows immediately. 

Together with Lemma [4.1[ , we obtain the following. 



□ 
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Corollary 4.4. Define for any a > Othe ball 

e^(cj) = {P G : H{P,F*) < a} . 
Then for any e > 0, n > 1 and r* < r < p{n) 



Fn n max < sup logP(xi:i) - logP*(xi:i|xi:r-) > > £ 
i=n,...,2n Pg©"- I 



oo 
k=0 



P6e'-(C42''-e/(n-r)) 

The proof is straightforward and is therefore omitted. 

4.4. Control of the bracketing entropy. We have now reduced the proof of 



Proposition 3.1 to the problem of estimating the summands in Corollary 4.4. We 



aim to do this by applying Proposition |A.2| in the Appendix with ©CO*", 



eP ^ 1 \og{I'{Xi\Xi-r:i~l) {Xi\Xi_r:i-l)) for i > r, 

[0 for i < r, 

and K = 2.To this end, the main remaining difficulty is to estimate the bracketing 



entropy of Definition |A.1[ This is our next order of business. 

Lemma 4.5. Given c > 0, there exists C5 > depending only on c such that 

(C^^{2n-r)a\ 



\ogJ{{2n,Q'{a),Fn,2,5) < lAr+^log 



for all n > 1, r* < r < p{n), o" > and < 5 < c^J (2n — r) a. 



Proof. Fix n > 1, r* < r < p{n), a > and < 5 < CyJ{2n - r) a 
throughout the proof. We begin by defining the family of functions 



Ti3 = {p: A 



r+l 



where /3 > is to be determined in due course. We claim that for any P G Q^, 
there exist A^, 7^ G T/j such that for all ai;r+i G A''+^ with P*(ai:r) > 



A^(ai:r+l) < P(ar+l|ai:r) < J^{ai;r+l] 



and 



P 

P*(ai:.)V2- 
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Indeed, this follows immediately by setting 



A^(ai:r+l) 
7^(ai:r.+l) 



)1/2J' 

/?-lP*(ai:,)^/2 

rr^p*(ai:.)V2p( 

ar+l|«l:r 



/3-iP'^(ai:,)^/2 
for all ai-r+i G A''+i with P*(ai:,.) > 0. Therefore P*-a.s. 



Af := log 



P* (Xi\Xi—r;i—l) 



<^f <log(li^i^iz!Zzll 

\ P*(^i|^j— r:i— l) 



for all P G G'', i > r (we set Af = Tf = for i < r), where we have defined 

{Xi\Xi_r:i-l) = {-y^ (Xi-r-.i) + P*(Xi |Xi_r:j-l)}/2 and {Xi\Xi-r:i-l) = 

{X^ {xi-r:i) + P*(xi|xj_,.:j_i)}/2. Moreover, we can estimate 



2n 



i=l 







2n 








<4EE 


[( 






i=l 





(^Xi\Xi—r:i—l)^^'^ / 



3", 



2n 



^8 E E i)'^' 

< 4 E ^2n(ai;.) (7^(ai;,,+i)i/2 _ A^(ai:,+i)i/2 

ai:r + ieA'- + l 

A''2n(ai:r) 



<4/?2 ^ 



ai:r + ieA' 



,.+1 P^(«l:' 



where we have used that (/>(x) < (e^ — 1)^/2 for x > and [[]], Lemma 4.2. As in 
the proof of Lemma we find that for any P G 0'' 



2n 



^Ee 

i=\ 



Tf -Af 



< 4C4(2n-r)|A|^+i/32 



on the event F„ (as r < by assumption). Therefore, if we choose 



/3 



V4C4(2n-r)|A| 



7+T' 



then {(Af , Tf )i<i<2n}pee' (<7) is a (2n, G'''(o-), F„, 2, 5) -bracketing set. To com- 
plete the proof we must estimate the cardinality of this set. 
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We approach this problem through a well known geometric device. We can rep- 
resent any function from A'^'^^ to M as a vector in r!''^!'^^^ in the obvious fashion. 
In particular, for any p : /V''^^ M, denote by l[p] the representative in r!''^!'^^^ of 
the function p{ai-r+i) = P*(ai:r.)^/^p(ai:r.+i)^/^. Then by [Q], Lemma 4.2 



where B{x, h) denotes the Euclidean ball in rI''^!'^^^ with center x and radius h. On 
the other hand, we clearly have ^[T/j] = C RI^I"""^'. Define for any 

x, x' G rI''^!'^^^ with x' >- x the cube [x, x'] : = {x G RI^I"^' : X ^ X ^ x'}. Let 

Ep := {x E :[x,x + n B{xo, A^) / 0}, 

where 1 € rI''^!'^^^ denotes the vector all of whose entries are one. Then clearly 

i[e"(cj)] c 5(xo,4V^)nR^i;'^' c IJ [x,x + pi], 

and, in particular, it is easily established from our previous computations that 

3sf(2?7,, Q^{a),Fn, 2, 6) < Now suppose that x' G [x,x + /31] for some x G 
H/j. Then there is an x" G [x,x + pi] such that x" G B{xo, In particular, we 

have \\x'-B{xo,4:V^)\\oo < /?, and therefore \\x' - B{xo,A^)\\2 < | A|(''+i)/2^, 
for every x' G [x,x + f31], x G H^. We conclude that 

y [x,x + /31] C5(xo,4^+|A|(^'+i)/2/?). 
Therefore, we can estimate 

Z?'"''''^' = vol I U [x,x + /31]| < vol(5(xo,4V^+|A|('^+^)/2^ 

= (4V^+|A|(-+i)/2/3)|Ar+\ol(S(0,l)). 
But from [[]], p. 249 we have the estimate 

Substituting the expression for /3 and reaiTanging, we find that 



< 



where we have used that 5 < C\J (2n — r)(T. The proof is easily completed. □ 
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4.5. End of the proof. To complete the proof of Proposition it remains to 



put together the results obtained above with Proposition A. 2 in the Appendix 



Proof of Proposition 3.1. In the following, we will always apply Lemma 



4.5| and Proposition |A.2| with the same constants c, cq, ci > 0. The appropriate 
values of these constants will be determined below. We will also fix n > 1, r* < 
r < p{n) and e > C2I Al**, with the constant C2 to be determined. 

To apply Corollary 4.4, we invoke Proposition A. 2 with K = 2, and 
R = C-il^^'^e (fixing /c > for the time being). We find that 



Fn n sup 

PGe'-(C42fe£/(n-r)) 



< 2 exp 



CsC^c, + 1) 



provided that Cg > C^ici + 1) and 



CO ^log3^{2n,Q^'{^),Fn,2,u)du < 2'-'e < ciC^2''+'e. 

To ensure that the second inequality holds, it suffices to choose ci = (SCs)"^, and 
the condition on co is satisfied by choosing cq = C\/ (SCa)"-^ + 1. To simplify the 
first inequality, choose c = ^JSC^/C/^. Then the variable u in the integral satisfies 



u < y C32'=+3e < c^J{2n - r)C42^e/{n - r 
so by Lemma ^ it suffices to ensure that 



2»-'£ > |A|l'-+''''2cy'(8C3)-i + 1 j 



\ 



log (i^ill^W*. 



where we have used that r < p{n) < n/2 implies {2n — r)/{n — r) < A. Defining 



\ 



log \ dv < 00, 



a simple change of variables shows that the above inequahty is equivalent to 



or, equivalently, 



2^e > 4C|C2((8C3)-1 + l)|Ar+^ 
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But this is always satisfied if we choose C2 = 4C|C^((8C3)^^ + 1)|A|. 



With these choices of c, cq, ci, C2, we have thus shown that by Corollary i.4 



Fn n max I sup logP(a;i:i) - log P*(xi:j|xi:,.) ) > e 

t=n,...,2n pg0r I 



< 2 ^ exp 

fc=0 



2^6 



25C2(C3 + 1/8) 



< C[ exp 



£ 



with 



Ci = 2^^2(^3 + 1/8), 



1 _ e-C2/25C2(C3 + l/8) ' 

where we have used e > C2. This completes the proof. 

APPENDIX A: A MAXIMAL INEQUALITY FOR MARTINGALES 



□ 



The purpose of this Appendix is to obtain a deviation bound on the supremum 
of an uncountable family of martingales, extending a result of van de Geer [[|]. 

We work on a filtered probability space (fi, 9", {9"i}i>o, P)- We are given a pa- 
rameter set and a collection {ii )i>i, 9 G of random variables such that is 

-measurable for all i, 0. This setting will be presumed throughout the Appendix. 
In the following we will frequently use the function (p{x) = — x — 1. 

Definition A.l. Let n e N, F e 3', K > and 5 > be given. A finite 
collection {(A^ , T|)i<j<n}j=i,...,Ar of random variables is called a (n, 0, F, K, 6)- 
bracketing set if A|, are iJ'j -measurable for all and for every S 0, there 
is a 1 < j < (the map 1-^ j is nonrandom) such that P-a.s. 



A] < ef < T| for alH = 1, 



,n 



and such that 



2^2 



i=l 



IT-'' - A^'l 
K 



9", 



< 5^ on F. 



We denote as >[(n, 0, F, K, 5) the cardinahty N of the smallest (n, 0, F, K, 5)- 
bracketing set (log 3Sf(n, & , F,K, 6) is called the bracketing entropy). 

The following extends a result of van de Geer [Q], Theorem 8.13. 

Proposition A.l. Fix K > 0, and define far a// i > 



m 

K 
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There is a universal constant C > such that for any n £ N, R < oo and F G 3' 

„2 



F n sup Ijijs <^ maxMj >a 



< 2 exp 



a 



C2(ci + l)R 



for any a, cq, ci > such that Cg > C^(ci + 1) and 



Co 



log e, F, K,u)du<a< 



ciR 
IT' 



[For example, the choice C = 100 works.^ 

Remark A. 3. Throughout, all uncountable suprema should be interpreted as 
essential suprema under the measure P. Thus measurability problems are avoided. 

For our purposes, the key improvement over Theorem 8.13 is that the bound 
in this result is given for maxj<„ A/f rather than A/^. This is essential in order 



to employ the blocking procedure in the proof of Theorem 2.3. Rather than repeat 
the proof of Theorem 8.13 here with the necessary modifications, we take the 
opportunity to obtain a more general result from which Proposition A.2 follows]^ 



Theorem A.4. Fix K > 0, and define for alli>0 

e=i i=i 
Then we have for any nGN, /2<cxd, and x > 



K 



F n sup 1^9 <^ max Mf > 16 J{ + + IQKx 



< 2e" 



where we have written 



:K = K log e,F,K, VR) + 4 ^JlogJ^{n,e,F,K, u) 



du. 



Before we proceed, let us prove Proposition A.2 using Theorem A.4 



' A closer look at the proof of ||J|, Theorem 8. 13 reveals a few inconsistencies which are corrected 
here. For example, equation (A. 12) in ^ seems to presuppose that X > on an event A implies that 
P[^|S] > on A, which need not be the case. The bracketing condition given in [[]], Definition 8.1 
therefore seems too weak to give the desired result. Similarly, the version of Bernstein's inequality 
given as Tu, Lemma 8.9 does not appear to be the one used in the proof of Theorem 8.13. 
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Proof of Proposition |A.2| . Let a = y/C^ici + l)Rx and assume that the 
given bounds on a hold. Then we can estimate 



a 



ciR a 
< —— X ■ 



< 



a 



C72(ci + 1)R- K C2(ci + l)R - C^K 
On the other hand, as 3sf(n, @,F,K, 6) is nonincreasing, we have 

/o 



K 



cq\Ir log 3sr(n, e , F, K, VR) < Co f"^ y^log 6, F, /C, u) du < a. 





Applying Theorem A.4 , we find that 



, . fl6ci 64 32 16 1 

F n sup l^s max Mj > < — 5- H ^ = + 7:^:7 > a 



< 2 exp 



C2(ci + 



But using Cq > C2(ci + 1) > C^, we can estimate 



16ci 64 



32 



16 32 96 , 
+ ^ < ^ + ^ < 1 



00 CO VC2(c^ + 1) ' ^2 - C2 ' C 

for C sufficiently large (e.g., C = 100). 



□ 



The remainder of the Appendix is devoted to the proof of Theorem |A.4[ It should 
be emphasized that the approach taken here is entirely standard in empirical process 
theory: the notion of bracketing entropy for martingales and the proof of the req- 
uisite form of Bernstein's inequality follows van de Geer [Q], while the relatively 
transparent proof of Theorem A.4 closely follows the proof given by Massart [Q], 
Theorem 6.8 in the i.i.d. setting. The full proofs are given here for completeness. 
Note also that we have made no effort to optimize the constants in the proof (the 
constants are necessarily somewhat larger than those obtained in [Q] due to the 
presence of the additional maximum maxi<„ Mf). 

A.l. A variant of Bernstein's inequality. The following result is a variant of 
Bernstein's inequality for martingales. It slightly improves on [[]], Lemma 8.11 in 
that we do not assume that E[^j|?'j_i] = for all i (though it appeai^s that this 
version is implicitly used in the proof of Theorem 8.13). 
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Proposition A. 5. Let {^i)i>i be a sequence of random variables such that 
is 3^i-measurable for all i, and define the martingale 

j 

Mj=Y.{^i-m\^'^-l]} forallj>0. 
1=1 

Fix K > 0, and let {Zj)j>Q be predictable (i.e., Zj is 3'j^i-measurable) such that 

j 

^E[|^ir|g"i_i] < mlK"'Zj for all m> 2, j > 0. 

i=l 

Then we have for all a > and Z > 

P [Mj > a and Zj < Z for some j] < exp 



a 



2K{a + 2KZ) 



Proof. Given A > A' we define tiie process {Sj)j>Q as Sj = e 
wliere Z^ = ^=1 E [(l){X\^i\)\ Ji-i]. Using 1 + x < e^, we find 



Sj-i 



3_ _ gA5j-i^i^(,j 



Now using tlie basic property <j){x) < and 1 + x < e^, we liave 



E 



5. 



< e-E[^«il^i-il {1 + E[Aei|:?i-i]} < 1. 
Thus Sj is a positive supermartingale. To proceed, define tiie stopping time 

r = min{j : Mj > a and Zj < Z}. 
Then {Mj > a and Zj < Z for some j} = {r < oo}. Moreover, as A~^ > K 



= E E E ^^.^i] < Z, Y.im' = Z, for all J. 

1=2 i=l 1=2 

Therefore Z^ < \^K'^Zr/{l - \K), and we can estimate 

^ ^\Mt-Z^ > ^\KU-\^K^Z^/{l-\K) y ^Xa-X^K^Z/{1-XK) |^ ^ 

We obtain, using the supermartingale property, 

P[t < oo] < E[l|,<^|e^'^'^/(^-^^^)-^°S,] < e^'^'^/(i-^^)-^". 
The proof is completed by choosing A~^ = K + 2K'^Z/a. □ 
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Corollary A. 6. Let {ii)i<i<n be a sequence of random variables such that 
is Hi-measurable for all i, and fix K > 0. Define {Mj){)<j<n cind {Rj)o<j<n ci^ 



i=l 



i=l 



K 



Then we have for all a > and R > 



max Mj > a and Rn < R 



< exp 



a 



2{Ka + R) 



If in addition ||^j||oo < 3f/ for all i, then for all a > and R > 

< exp 



max Mj > a and Rn < R 



a 



2{Ua + R) 

Proof. To obtain the first inequality, note that for any m > 2 and j > 



m\K' 



1 J oo -. ] 



i=l 



771=2 1 = 1 



Rj 

2K^' 



We can therefore apply Proposition A. 5 with Zj = Rj/2K^. For the second in- 
equality, note that ||^j||oo < 3[/ implies that for all m > 2 and j > 



< m-^R. < 



i=l i=l 

where we used that m! > 2 x 3™-^ for m > 2. We can therefore apply Proposition 
A.5 with Zj = Rj/2U'^. It remains to use that Rj is nondecreasing. □ 

A.2. Maximal inequalities for finite sets. The following result allows us to 
control finite families of random variables that satisfy a Bemstein-type deviation 
inequality. A sharper form of this result can be obtained using an estimate on the 
moment generating function of the random variables, see [^], Lemma 2.3, but we 
do not have such an estimate for the maximum maxj<„ M^. Throughout the re- 
mainder of the Appendix, we define E"^[X] = E[lyiX]/P[A] for any event A E 3^. 

Lemma A. 7. Let Xi, . . . , be random variables such that 

v2 



P[|^i| >a]< exp 
Then we have for any event A H 



a 



2{Ka + R) 



for alll<i<N. 



max Xi 

i=l,...,N 



< J8R loff 1 + 



N 
P[A] 



+ log 1 + 



N ' 
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Proof. Let Tp{x) be a Young function. Then 



maxj<Ar ll^illv' 



i<N 



\Xi\ 



maxw r— 

i<N \\\Xi\y 



< 



1 



where 



XiWip 

denotes the Orhcz norm. Therefore 



\Xi 



\X., 



i\\tp 



< 



N 

p[Ay 



max iXj 



To proceed, note that for 1 < f < 



P[|X,|1|;,^I<^/^ > a] = P[R/K > \Xi\ >a]< exp 



P[\Xi\l\x,\>R/K > a] = P[|^*l > a V R/K] < exp 



a 
a 

4K 



By [Q], Lemma 2.2.1, ||Xil|Xi|</?/i^ IU2 < ^Sfland ||Xil|x,|>ij/it lUi < SET for 
all i, where ipp{x) = e^^ — 1. The proof is easily completed. □ 

Corollary A. 8. Let (^^f )i<i<n. h = 1, . . . , N be random variables such 
that is 3'i-measurable for all i, h. Fix K > 0, and define 



i=l 

Then we have 



i=l 







[•(») 





E^ 



max Ij^h^jiioaaxM: 



h=l,...,Af 



< J8R loK 1 + 



N 
P[A] 



+ 8K log 1 + 



for any event A G 9". If in addition ||oo < 3Ufor all i, h, then 



max l^h<j:jmaxAf • 



_h=l,...,N j<n 

for any event A £ 3'. 



< J8R log 1 + 



N 
P[A] 



+ 8[/ log 1 + 



P[A] 



N 
P[A] 



Proof. Apply the previous lemma with Xh = lij'j<K maxj<„ Mj. Note that 
as Mq = 0, certainly X^ > 0. Therefore X^ = \Xh\, and the requisite tail bounds 
are obtained immediately from Corollary A^ above. □ 



24 



RAMON VAN HANDEL 



A.3. Proof of Theorem |A.4 We now proceed to the proof of Theorem |A.4 . 
We follow closely the proof given by Massart [[]], Theorem 6.8 in the i.i.d. setting. 
The general approach, by means of a chaining device with bracketing with adaptive 
truncation, is standard in empirical process theory. 

Before we proceed to the proof, let us define the function 

<^{x) -.= 16% + 32Vftc + 16Kx, 

where ?f is as defined in Theorem |A.4 We claim that in order to prove the Theo- 
rem, it actually suffices to prove the estimate 



sup Ira <ij max Mi 



< $ loe 1 + 



P[A] 



for any event A C F. Indeed, if this is the case, then choosing 



A = Fn\snp <K max Mf > 



allows us to estimate 



supli?s</?maxi\^ 

9ee 



< $ log 



P[A] 



from which the conclusion of Theorem A.4 is immediate. We therefore concentrate 



without loss of generality on obtaining the above estimate. 



Proof of Theorem |A.4| . We fix n g N, < oo, F g J and A c F 
throughout the proof. Define 6j = 2"^/^ and Nj = 3^{n, G, F, K, 6j) for j > 0. 
We assume that Nj < oo for all j, otherwise there is nothing to prove. Therefore, 
for each j, we can choose a collection = {(A^''', T^''')i<j<„}p=i^,,,^7Vj that 



satisfies the conditions of Definition A. 1 , and these will remain fixed throughout 
the proof. In particular, for every j, 9, there exists p{j, 0) such that 



i — ?i — i 



for all i = 1, 



,n. 



For notational simplicity, we will write 



T 



At the heart of the proof is a chaining device: we introduce the telescoping sum 



+ E{ni 



n: 
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where by convention 11 ■ ' = 11^ ' . The length of the chain is chosen adaptively: 

rf = min{j > : A]'^ > aj} A J. 

The levels aj > and J > 1 will be determined later on (we will choose aj to 
control the second term in Corollary A.8, and we will ultimately let J — > oo). 
It will be convenient to split the chain into three parts: 



(A.l) 
(A.2) 



nr + 

j=0 

J 



1. -J 



^ {(nf A nr^'^ - nr^'^)i,«^^ + (nf - nr^'^)i,«,^} . 



Denote by the summands in ( |A.l| ) by (^f the summands in ( |A.2[ ), and define the 

martingales = ELi{n°''-E[n°'V^-i]}> = ELil^f -E[6f 
and C/'^ = X]^=i{cf'^ — E[c;^'^|9"£_i]}. We will control each martingale separately. 
Control of A^. As is convex and nondecreasing, and as |n^'^ — ,^^| < |A^'^|, 



in^ 



2K 



2K 



< 



< 



lA 



K 



+ 



m 

K 



Using Definition A. 1 , we find that 



Therefore 



^-1 



< 2{5l +R) = 4:R on <R}nF. 



supli?e</jmaxAi 



< J32R log 1 + 



P[A] 



+ 16K loff 1 + 



No 
P[A] 



by Corollary |A.8| , where we have used that A C F. 
Control of B^. Note that 6^ ^ < 0, so that 



6^/^ - E[6f |3-,_i] < E[(nf A nr^'^ - < E[A^h.m_,]. 
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Consider first the case that j < J. When = j, we have Af^ > aj. Thus 



K 



3^1- 



where we have used |xp < 2K'^(j){\x\/ K). In particular, 



"1 £=1 



< ^ on F, 



where we have appUed Definition |A.1| . As A C F, it follows that 
E^ 



suplije<RmaxBf' 
6»ee «<" 



< ^ for j < J. 



Now consider the case j = J. We can estimate 



.1/2 



;^E[|Afn3-, 



< 6jVi on F, 



where we have applied the same computations as above. It follows that 



E^ 



sup Irs <ij max a 



< 5J^/n, 



where we have used that A C F. 



Control of C. As U'/ - nr"' = - Ce + - we have 



Therefore 



As A;^'^ < Oj whenever > j, we find that 

Moreover, as Ic^'H < A^~''^ V Af < A^ ' + Af , we obtain using that 6 is 
convex and nondecreasing (in the same manner as above for the control of A^) 



i=l 



2K 



< 2(5? 1 + ^2) onF, 
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where we have used Definition lA.ll. As ^ C F, we can therefore estimate 



suplR9<i?maxC/'' 



sup ,.2 , r2-, maxCf 



Now note that c^'^ depends on only through the values of p(0, 6), . . . , p{j, 6). In 
particular, for fixed j, the supremum of 1^j:9<2(52 _^^2) maxj<„ C/' as 6* varies 

over is in fact only the maximum over a finite collection of random variables, 
whose cardinality is bounded above by the quantity 

Ni := ri ^P- 

We therefore obtain the estimate 



suplije<RmaxC/ 



i<n 



where we have applied Corollary | 

End of the proof. Note that by construction 

j=0 j=l 

for all i, 9. Collecting the above estimates gives 



E^ 



sup l/je <^ max M\ 



6»ee 



<6jV^ + 5oJ32 log 1 + 



No 
P[A] 



16K log 1 + 



No 
P[A] 



+ E- 



+ E {^.^80 log (l + ^) + ^ V a,) log (l + ^ 
We aim to choose Oj such that the log(l + Nj/P[^]) terms disappear. Set 

-1/2 



aj = 6j \ — lo. 



1 + ^ 
P[^] 



28 RAMON VAN HANDEL 

Then aj is decreasing with increasing j, so aj_i V aj = aj^i and 



suplM<RmaxM! 



6»ee 



i<n 



< 6jV^+16K log (^1 + ^) + IGpJj^log (^1 + 



We now estimate as follows: 



J 



E^. \ log 1 + 



and 



P[A] 



<^<5,./log(l + 
j=o V 



P[^] 



+ E'^^Eviog^p 



j=0 p=0 



J j 

E^.E 

j=0 p=0 



We obtain 



oo 

logA^p<E 



oo 

^E 

p=0 



p=0 

{5p - 6p+i] 



J oo oo 

\Aog^E ^^^p<^ ^ E \Aog^E 



j=0 



p=0 



\ogNp<A 



log ]M"(n,e,F,K, u)du. 



suplRe</jmaxMj 



< 5j + ^> log 1 + 



P[^] 



The result follows by letting J — > oo. 



□ 
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